Generalized Linear Models (GLMs)#
Generalized Linear Models extend linear regression in two key ways:
Mean modeling via a link function
We don’t predict the mean \(\mu\) directly from a linear model. Instead we predict a linear predictor:
and connect it to the mean via an inverse link \(g^{-1}\):
Squared loss is replaced by a distribution-aware loss
Instead of assuming Gaussian noise and minimizing squared error, we choose a distribution from the exponential family (more precisely: an exponential dispersion model) and minimize its unit deviance (plus regularization).
Learning goals#
understand what GLMs are (and why they exist)
connect: distribution ↔ link ↔ loss (deviance)
see how common models are just GLMs:
linear regression (Normal + identity)
logistic regression (Bernoulli + logit)
Poisson regression (Poisson + log)
Gamma / Inverse Gaussian regression (positive skew + log)
implement IRLS (iteratively reweighted least squares) for logistic and Poisson regression
map the ideas to scikit-learn’s GLM estimators
Table of contents#
Why GLMs? (what linear regression can’t do)
The GLM recipe (exponential family + link)
Deviance: “distribution-aware squared error”
PDFs/PMFs of common GLM distributions
Optimization intuition: IRLS (Newton as weighted least squares)
Logistic regression as a GLM (Bernoulli)
Poisson regression as a GLM (Poisson)
Gamma / Inverse Gaussian (via Tweedie)
Multiclass (categorical) GLM intuition
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
from scipy.special import expit
from scipy.stats import norm, poisson, gamma as gamma_dist, invgauss
from sklearn.datasets import make_blobs
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, PoissonRegressor, GammaRegressor, TweedieRegressor
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(7)
1) Why GLMs?#
Linear regression predicts a real number:
That’s perfect when \(y\) is continuous and (roughly) Gaussian.
But what if:
\(y\) is a probability (must be between 0 and 1)
\(y\) is a count (must be \(0,1,2,\dots\))
\(y\) is positive and skewed (e.g., time-to-failure, insurance claim size)
GLMs solve this by keeping the linear predictor (the “score”) but changing:
the mapping from score → mean (the link)
the loss (via the likelihood / deviance)
Anecdote:
A linear model is like a “thermostat knob” (it can turn freely). GLMs add the right display and physics so the knob controls the correct kind of quantity (probability, rate, positive mean, …).
# Same linear score η, different inverse links μ = g⁻¹(η)
eta = np.linspace(-8, 8, 500)
mu_identity = eta
mu_sigmoid = expit(eta) # Bernoulli mean via logit link
mu_exp = np.exp(eta) # Poisson mean via log link
fig = go.Figure()
fig.add_trace(go.Scatter(x=eta, y=mu_identity, name="identity (Normal)", mode="lines"))
fig.add_trace(go.Scatter(x=eta, y=mu_sigmoid, name="sigmoid (Bernoulli/logit)", mode="lines"))
fig.add_trace(go.Scatter(x=eta, y=mu_exp, name="exp (Poisson/log)", mode="lines"))
fig.update_layout(
title="One linear score η = Xβ, many different means μ = g⁻¹(η)",
xaxis_title="η",
yaxis_title="μ",
width=900,
height=450,
)
fig.show()
2) The GLM recipe#
A GLM is defined by:
A distribution for \(y \mid x\) from an exponential family / EDM
A link function \(g\) connecting the mean to the linear predictor
Exponential family (one common form)#
A lot of useful distributions can be written as:
\(\theta\) is the natural parameter
\(\phi\) is a dispersion parameter
\(b(\theta)\) controls the mean/variance
In GLMs, the link often connects \(\mu\) to \(\eta\) in a way that is convenient (the canonical link is a common choice).
3) Deviance: “distribution-aware squared error”#
For Gaussian noise, minimizing squared error is equivalent to maximizing a Gaussian likelihood.
For general GLMs, we maximize likelihood for the chosen distribution. In many GLM-style estimators, this is written using the unit deviance \(d(y, \mu)\).
A typical regularized objective is:
If sample weights \(w_i\) are provided, the average becomes a weighted average.
Unit deviance cheat sheet (common EDMs)#
Below are standard unit deviances (with conventions like \(0\log 0 = 0\)).
Distribution |
Target domain |
Unit deviance \(d(y,\mu)\) |
|---|---|---|
Normal |
\(\mathbb{R}\) |
\((y-\mu)^2\) |
Bernoulli |
\(\{0,1\}\) |
\(2\left[y\log\frac{y}{\mu} + (1-y)\log\frac{1-y}{1-\mu}\right]\) |
Categorical (multinomial, 1 trial) |
one-hot |
\(2\sum_k y_k\log\frac{y_k}{\mu_k}\) |
Poisson |
\(\{0,1,2,\dots\}\) |
\(2\left[y\log\frac{y}{\mu} - (y-\mu)\right]\) |
Gamma |
\(\mathbb{R}_{+}\) |
\(2\left[\frac{y-\mu}{\mu} - \log\frac{y}{\mu}\right]\) |
Inverse Gaussian |
\(\mathbb{R}_{+}\) |
\(\frac{(y-\mu)^2}{\mu^2 y}\) |
Interpretation:
Normal deviance is just squared distance.
Bernoulli deviance matches log-loss (up to a constant).
Poisson deviance compares counts in a way that respects positivity and relative error.
# Visual intuition: unit deviance as a function of μ for fixed y
mus = np.linspace(1e-4, 0.9999, 500)
# Bernoulli deviance for y=1 and y=0
bern_y1 = 2.0 * np.log(1.0 / mus)
bern_y0 = 2.0 * np.log(1.0 / (1.0 - mus))
# Poisson deviance for y=5 (μ must be > 0)
mu_pos = np.linspace(1e-3, 15, 500)
y_p = 5.0
pois_dev = 2.0 * (y_p * np.log(y_p / mu_pos) - (y_p - mu_pos))
fig = make_subplots(rows=1, cols=2, subplot_titles=("Bernoulli unit deviance", "Poisson unit deviance (y=5)"))
fig.add_trace(go.Scatter(x=mus, y=bern_y1, mode="lines", name="y=1"), row=1, col=1)
fig.add_trace(go.Scatter(x=mus, y=bern_y0, mode="lines", name="y=0"), row=1, col=1)
fig.update_xaxes(title_text="μ", row=1, col=1)
fig.update_yaxes(title_text="d(y,μ)", row=1, col=1)
fig.add_trace(go.Scatter(x=mu_pos, y=pois_dev, mode="lines", name="y=5"), row=1, col=2)
fig.update_xaxes(title_text="μ", row=1, col=2)
fig.update_yaxes(title_text="d(y,μ)", row=1, col=2)
fig.update_layout(width=1000, height=420, title="Deviance = mismatch measure tailored to the distribution")
fig.show()
# PDFs/PMFs: a quick visual gallery of common GLM distributions
fig = make_subplots(
rows=2,
cols=3,
subplot_titles=(
"Normal (PDF)",
"Bernoulli (PMF)",
"Categorical (PMF)",
"Poisson (PMF)",
"Gamma (PDF)",
"Inverse Gaussian (PDF)",
),
)
# Normal
x = np.linspace(-4, 4, 400)
fig.add_trace(go.Scatter(x=x, y=norm.pdf(x, loc=0, scale=1), mode="lines"), row=1, col=1)
# Bernoulli
p = 0.3
fig.add_trace(go.Bar(x=[0, 1], y=[1 - p, p]), row=1, col=2)
# Categorical
probs = np.array([0.15, 0.25, 0.60])
fig.add_trace(go.Bar(x=["A", "B", "C"], y=probs), row=1, col=3)
# Poisson
lam = 4.0
k = np.arange(0, 16)
fig.add_trace(go.Bar(x=k, y=poisson.pmf(k, mu=lam)), row=2, col=1)
# Gamma
xg = np.linspace(1e-4, 10, 400)
shape = 2.0
scale = 1.0
fig.add_trace(go.Scatter(x=xg, y=gamma_dist.pdf(xg, a=shape, scale=scale), mode="lines"), row=2, col=2)
# Inverse Gaussian (SciPy parameterization)
xi = np.linspace(1e-4, 10, 400)
mu_param = 1.5
scale_param = 1.0
fig.add_trace(go.Scatter(x=xi, y=invgauss.pdf(xi, mu=mu_param, scale=scale_param), mode="lines"), row=2, col=3)
fig.update_layout(width=1100, height=650, title="Distribution shapes (PDF/PMF)")
fig.show()
4) Optimization intuition: IRLS#
For many GLMs (especially with canonical links), Newton’s method / Fisher scoring yields an update that looks like:
solve a weighted least squares problem, update \(\beta\), repeat
This is called IRLS (Iteratively Reweighted Least Squares).
A useful mental model:
a GLM is “like linear regression”, but the relationship between \(\eta\) and \(\mu\) is nonlinear
IRLS repeatedly builds a local quadratic approximation and solves a linear-looking problem
We’ll implement IRLS for:
Bernoulli + logit (logistic regression)
Poisson + log (Poisson regression)
def add_intercept(X: np.ndarray) -> np.ndarray:
X = np.asarray(X, dtype=float)
return np.c_[np.ones((X.shape[0], 1)), X]
def irls_logistic_regression(
X: np.ndarray,
y: np.ndarray,
alpha: float = 0.0,
max_iter: int = 50,
tol: float = 1e-8,
):
"""Logistic regression via IRLS with optional L2 penalty (ridge).
alpha: L2 strength (penalizes coefficients except intercept).
"""
X = np.asarray(X, dtype=float)
y = np.asarray(y, dtype=float)
n, d = X.shape
beta = np.zeros(d, dtype=float)
losses = []
reg = alpha * np.eye(d)
reg[0, 0] = 0.0
eps = 1e-12
for _ in range(max_iter):
eta = X @ beta
mu = expit(eta)
w = mu * (1.0 - mu)
w = np.clip(w, 1e-9, None)
z = eta + (y - mu) / w
sqrt_w = np.sqrt(w)
Xw = X * sqrt_w[:, None]
zw = z * sqrt_w
A = Xw.T @ Xw + reg
b = Xw.T @ zw
beta_new = np.linalg.solve(A, b)
nll = -np.mean(y * np.log(mu + eps) + (1.0 - y) * np.log(1.0 - mu + eps))
penalty = 0.5 * alpha * np.sum(beta[1:] ** 2)
losses.append(float(nll + penalty))
if np.linalg.norm(beta_new - beta) <= tol * (np.linalg.norm(beta) + tol):
beta = beta_new
break
beta = beta_new
return beta, np.array(losses)
# Logistic regression demo (2D)
X, y = make_blobs(
n_samples=700,
centers=[(-2.0, -1.0), (2.0, 1.0)],
cluster_std=[1.6, 1.4],
random_state=7,
)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=7, stratify=y)
fig = px.scatter(
x=X_tr[:, 0],
y=X_tr[:, 1],
color=y_tr.astype(str),
title="Binary classification dataset",
labels={"x": "x1", "y": "x2", "color": "class"},
)
fig.update_traces(marker=dict(size=6, opacity=0.7))
fig.update_layout(width=720, height=470)
fig.show()
X_tr_i = add_intercept(X_tr)
X_te_i = add_intercept(X_te)
beta_hat, losses = irls_logistic_regression(X_tr_i, y_tr, alpha=0.5, max_iter=80)
# sklearn reference
sk = LogisticRegression(C=1/0.5, penalty="l2", solver="lbfgs", fit_intercept=True, max_iter=2000)
sk.fit(X_tr, y_tr)
pred_scratch = (expit(X_te_i @ beta_hat) >= 0.5).astype(int)
pred_sklearn = sk.predict(X_te)
print("Scratch IRLS logistic accuracy:", accuracy_score(y_te, pred_scratch))
print("sklearn LogisticRegression accuracy:", accuracy_score(y_te, pred_sklearn))
print("Scratch beta:", beta_hat)
print("sklearn beta:", np.r_[sk.intercept_, sk.coef_.ravel()])
Scratch IRLS logistic accuracy: 0.9666666666666667
sklearn LogisticRegression accuracy: 0.9666666666666667
Scratch beta: [-0.0781 1.8208 1.0538]
sklearn beta: [-0.0781 1.8208 1.0538]
# Convergence plot
fig = go.Figure()
fig.add_trace(go.Scatter(y=losses, x=np.arange(1, len(losses) + 1), mode="lines+markers"))
fig.update_layout(title="IRLS convergence (avg NLL + ridge)", xaxis_title="iteration", yaxis_title="loss", width=700, height=420)
fig.show()
def plot_probability_surface_2d(model_proba_fn, X, y, title: str, grid_steps: int = 220):
x_min, x_max = X[:, 0].min() - 1.0, X[:, 0].max() + 1.0
y_min, y_max = X[:, 1].min() - 1.0, X[:, 1].max() + 1.0
xs = np.linspace(x_min, x_max, grid_steps)
ys = np.linspace(y_min, y_max, grid_steps)
xx, yy = np.meshgrid(xs, ys)
grid = np.c_[xx.ravel(), yy.ravel()]
proba = model_proba_fn(grid).reshape(xx.shape)
fig = go.Figure()
fig.add_trace(go.Contour(
x=xs,
y=ys,
z=proba,
colorscale="RdBu",
opacity=0.75,
contours=dict(showlines=False),
colorbar=dict(title="P(class=1)"),
))
fig.add_trace(go.Scatter(
x=X[:, 0],
y=X[:, 1],
mode="markers",
marker=dict(color=y, colorscale="Viridis", size=6, line=dict(width=0.5, color="white")),
name="data",
))
fig.update_layout(title=title, width=760, height=520)
fig.update_yaxes(scaleanchor="x", scaleratio=1)
return fig
fig1 = plot_probability_surface_2d(lambda Z: expit(add_intercept(Z) @ beta_hat), X_te, y_te, "Scratch logistic: P(class=1)")
fig1.show()
fig2 = plot_probability_surface_2d(lambda Z: sk.predict_proba(Z)[:, 1], X_te, y_te, "sklearn logistic: P(class=1)")
fig2.show()
5) Poisson regression (counts) as a GLM#
When \(y\) is a count, a natural model is Poisson:
A common link is the log link:
This guarantees \(\mu>0\).
IRLS exists here too and is very similar to logistic regression.
def irls_poisson_regression(
X: np.ndarray,
y: np.ndarray,
alpha: float = 0.0,
max_iter: int = 80,
tol: float = 1e-8,
):
"""Poisson regression via IRLS with optional L2 penalty (ridge).
Uses log link: μ = exp(η).
alpha penalizes coefficients except intercept.
"""
X = np.asarray(X, dtype=float)
y = np.asarray(y, dtype=float)
n, d = X.shape
beta = np.zeros(d, dtype=float)
losses = []
reg = alpha * np.eye(d)
reg[0, 0] = 0.0
eps = 1e-12
for _ in range(max_iter):
eta = X @ beta
mu = np.exp(eta)
w = np.clip(mu, 1e-9, None)
z = eta + (y - mu) / mu
sqrt_w = np.sqrt(w)
Xw = X * sqrt_w[:, None]
zw = z * sqrt_w
A = Xw.T @ Xw + reg
b = Xw.T @ zw
beta_new = np.linalg.solve(A, b)
nll = float(np.mean(mu - y * np.log(mu + eps)))
penalty = 0.5 * alpha * float(np.sum(beta[1:] ** 2))
losses.append(nll + penalty)
if np.linalg.norm(beta_new - beta) <= tol * (np.linalg.norm(beta) + tol):
beta = beta_new
break
beta = beta_new
return beta, np.array(losses)
# Synthetic Poisson regression
n = 1500
Xp = rng.normal(size=(n, 2))
Xp_i = add_intercept(Xp)
beta_true = np.array([0.3, 0.6, -0.4])
rate = np.exp(Xp_i @ beta_true)
yp = rng.poisson(lam=rate)
Xp_tr, Xp_te, yp_tr, yp_te = train_test_split(Xp, yp, test_size=0.3, random_state=7)
Xp_tr_i = add_intercept(Xp_tr)
Xp_te_i = add_intercept(Xp_te)
beta_p, losses_p = irls_poisson_regression(Xp_tr_i, yp_tr, alpha=1.0)
sk_p = PoissonRegressor(alpha=1.0, fit_intercept=True, max_iter=5000)
sk_p.fit(Xp_tr, yp_tr)
print("True beta:", beta_true)
print("Scratch beta:", beta_p)
print("sklearn beta:", np.r_[sk_p.intercept_, sk_p.coef_])
True beta: [ 0.3 0.6 -0.4]
Scratch beta: [ 0.3242 0.6087 -0.3772]
sklearn beta: [ 0.4849 0.3797 -0.2432]
# Predict and compare
mu_hat_scratch = np.exp(Xp_te_i @ beta_p)
mu_hat_sklearn = sk_p.predict(Xp_te)
rmse_scratch = float(np.sqrt(mean_squared_error(yp_te, mu_hat_scratch)))
rmse_sklearn = float(np.sqrt(mean_squared_error(yp_te, mu_hat_sklearn)))
fig = go.Figure()
fig.add_trace(go.Scatter(x=mu_hat_scratch, y=yp_te, mode="markers", name=f"scratch (RMSE={rmse_scratch:.2f})", opacity=0.6))
fig.add_trace(go.Scatter(x=mu_hat_sklearn, y=yp_te, mode="markers", name=f"sklearn (RMSE={rmse_sklearn:.2f})", opacity=0.6))
fig.update_layout(
title="Poisson regression: predicted mean μ vs observed count y",
xaxis_title="predicted μ",
yaxis_title="observed y",
width=900,
height=500,
)
fig.show()
6) Gamma / Inverse Gaussian (positive, skewed targets)#
When \(y>0\) and is skewed, Gaussian noise is often a poor match.
Two classic EDM choices:
Gamma: variance grows like \(\mu^2\) (common for positive skew)
Inverse Gaussian: even heavier variance growth
In scikit-learn, these show up via:
GammaRegressor(Gamma deviance)TweedieRegressor(power=3)(Inverse Gaussian deviance)
Tweedie family connection#
A convenient unification is the Tweedie variance function:
\(p=0\) → Normal
\(p=1\) → Poisson
\(p=2\) → Gamma
\(p=3\) → Inverse Gaussian
# A synthetic positive-skew regression dataset
n = 1800
Xg = rng.normal(size=(n, 2))
Xg_i = add_intercept(Xg)
beta_true = np.array([0.2, 0.5, -0.3])
mu_true = np.exp(Xg_i @ beta_true)
# Gamma sample with mean mu_true: choose shape k, scale=mu/k
k_shape = 2.0
scale = mu_true / k_shape
y_gamma = rng.gamma(shape=k_shape, scale=scale)
Xg_tr, Xg_te, yg_tr, yg_te = train_test_split(Xg, y_gamma, test_size=0.3, random_state=7)
m_gamma = GammaRegressor(alpha=1.0, fit_intercept=True, max_iter=5000)
m_gamma.fit(Xg_tr, yg_tr)
m_ig = TweedieRegressor(power=3.0, alpha=1.0, link='log', fit_intercept=True, max_iter=5000)
m_ig.fit(Xg_tr, yg_tr)
pred_gamma = m_gamma.predict(Xg_te)
pred_ig = m_ig.predict(Xg_te)
fig = go.Figure()
fig.add_trace(go.Scatter(x=pred_gamma, y=yg_te, mode="markers", name="GammaRegressor", opacity=0.6))
fig.add_trace(go.Scatter(x=pred_ig, y=yg_te, mode="markers", name="Tweedie power=3 (InvGauss)", opacity=0.6))
fig.update_layout(title="Positive skew regression: predictions vs observed", xaxis_title="predicted mean", yaxis_title="observed y", width=900, height=520)
fig.show()
print("GammaRegressor coef:", np.r_[m_gamma.intercept_, m_gamma.coef_])
print("InvGauss(Tweedie p=3) coef:", np.r_[m_ig.intercept_, m_ig.coef_])
GammaRegressor coef: [ 0.2419 0.2458 -0.1576]
InvGauss(Tweedie p=3) coef: [ 0.1594 0.2339 -0.1522]
7) Multiclass (categorical) GLM intuition#
For \(K\) classes, you can model:
This is a categorical (multinomial) GLM. The deviance corresponds to cross-entropy.
In practice, sklearn.linear_model.LogisticRegression with a multinomial-capable solver implements this.
# Quick multiclass demo (sklearn)
X3, y3 = make_blobs(
n_samples=900,
centers=[(-2, 0), (2, 0), (0, 2)],
cluster_std=[1.1, 1.1, 1.1],
random_state=7,
)
X3_tr, X3_te, y3_tr, y3_te = train_test_split(X3, y3, test_size=0.3, random_state=7, stratify=y3)
m_multi = LogisticRegression(solver='lbfgs', max_iter=5000)
m_multi.fit(X3_tr, y3_tr)
print("Multiclass accuracy:", accuracy_score(y3_te, m_multi.predict(X3_te)))
fig = px.scatter(x=X3_te[:, 0], y=X3_te[:, 1], color=y3_te.astype(str), title="3-class dataset", labels={"x":"x1","y":"x2","color":"class"})
fig.update_traces(marker=dict(size=6, opacity=0.7))
fig.update_layout(width=720, height=470)
fig.show()
Multiclass accuracy: 0.8925925925925926
8) Summary#
GLMs keep the linear predictor \(\eta=X\beta\), but change the mapping to the mean and the loss.
The distribution choice gives you the right domain and right notion of error (deviance).
IRLS gives an intuitive optimization story: repeatedly solve a weighted least-squares problem.
Exercises#
Implement IRLS for Gamma with log link (harder but very educational).
Create a dataset with over-dispersed counts and compare Poisson vs a Tweedie model.
Change
alphaand observe coefficient shrinkage for PoissonRegressor.
References#
scikit-learn: GLM and Tweedie regressors
“Generalized Linear Models” (McCullagh & Nelder)